Group 3
December 4, 2017
The objectives were to create summaries and visualizations of how the dependent variable is associated with the different independent variables. Our goal was to develop different models to analyze these associations in the data. Our work included:
If we can discover specific characteristics of the drugs that are associated with effectiveness against TB, this will help the researchers understand varying drug mechanisms in the body using a mouse model.
The dependent variables tested included drug concentrations in the lung tissue and spleen. The independent variables included several in vivo (mouse model) and in vitro tests that were performed by the TB research group.
<<<<<<< HEAD ## Idea development:
So far, we have focused on getting working prototypes, without making sure they’re error-proof and robust to a user doing something non-standard. Identify 3 things a user could do that could make your functions “break” (i.e., either return an error message or return something other than what you hope they will):
If drug names or codes change, this could create potential problems
efficacy summary, with column names including: “drug”, “dosage”, “dose_int”, “level”, “PLA”, “ULU”, “RIM”, “OCS”, “ICS”, “SLU”, “SLE”, “ELU”, “ESP”, “cLogP”, “huPPB”, “muPPB”, “MIC_Erdman”, “MICserumErd”, “MIC_Rv”, “Caseum_binding”, and “MacUptake”.-The next step is to include more data representing more drugs
-Validating the models (for RandomForest and Lasso) would help us understand the predictive power of the model to determine drug efficacy. The data can be subsetted and tested.
-A function could be written that runs all of the models and outputs the coefficients
Image retrieved from: https://www.pinterest.com/pin/233624299389735946/
linear_model <- function(peak_trough, dep_var,
data = efficacy_summary) {
function_data <- data %>%
filter(level == peak_trough) %>%
gather(key = independent_var, value = indep_measure,
-drug, -dosage, -dose_int, -level, -ELU, -ESP,
na.rm = TRUE) %>%
select(drug, dosage, dose_int, level, dep_var,
indep_measure, independent_var)
if(dep_var=="ELU")
{function_data$vect <- function_data$ELU}
if(dep_var=="ESP")
{function_data$vect <- function_data$ESP}
model_function <- function(data) {
model_results <- lm(vect ~ scale(indep_measure),
data = data)
} estimate_results <- function_data %>%
group_by(independent_var, dose_int) %>%
nest() %>%
mutate(mod_results = purrr::map(data,
model_function)) %>%
mutate(mod_coefs = purrr::map(mod_results,
broom::tidy)) %>%
select(independent_var, dose_int, mod_results,
mod_coefs) %>%
unnest(mod_coefs) %>%
filter(term == "scale(indep_measure)") coef_plot <- estimate_results %>%
mutate(independent_var = forcats::fct_reorder(
independent_var, estimate, fun = max)) %>%
rename(Dose_Interval = dose_int) %>%
ggplot(aes(x = estimate, y = independent_var,
color = Dose_Interval)) +
geom_point(aes(size = 1 / std.error)) +
scale_size_continuous(guide = FALSE) +
theme_few() +
ggtitle(label = "Linear model coefficients as function
of independent variables, \n by drug dose and
model uncertainty", subtitle = "Smaller points
have more uncertainty than larger points") +
geom_vline(xintercept = 0, color = "cornflower blue")
coef_plot
}linear_modelefficacy_summarydep_var options: “ELU” (lung efficacy) or “ESP” (spleen efficacy)peak_trough options: “Cmax” or “Trough”#Sample code for function, linear_model (Cmax and ELU)
linear_model(peak_trough = "Cmax", dep_var = "ELU")#Sample code for function, linear_model (Cmax and ESP)
linear_model(peak_trough = "Cmax", dep_var = "ESP")Coefficients that are far right or far left are most strongly associated relationships between independent and dependent variables
If the coefficient is negative, for example, as it is with MacUptake in the ELU linear regression model, an interpretation would be for every unit of change in the MacUptake, the ELU will decrease by 0.5 Units. Therefore, MacUptake has a negative relationship with ELU. The diameter of the point represents the level of certainty of the coeficient in this model. This may change as more data is collected for each drug.
rpart(ELU ~ drug + dosage + level +
plasma + `Uninvolved lung` + `Rim (of Lesion)` +
`Outer Caseum` + `Inner Caseum` +
`Standard Lung` + `Standard Lesion` + cLogP +
`Human Plasma Binding` +
`Mouse Plasma Binding` + `MIC Erdman Strain` +
`MIC Erdman Strain with Serum` +
`MIC rv strain` + `Caseum binding` +
`Macrophage Uptake (Ratio)`,
data = function_data,
control = rpart.control(cp = -1,
minsplit = min_split,
minbucket = min_bucket))dep_var options: “ELU” (lung efficacy) or “ESP” (spleen efficacy)min_split: numeric input indicating minimum # observations for a split to be attemptedmin_bucket: numeric input indicating minimum # observations in a terminal nodedata = efficacy_summary (default; must use this to run properly)regression_tree(dep_var = "ELU", min_split = 8,
min_bucket = 6)min_bucket = 4).min_split = 8).min_split or the min_bucket parameters are fulfilled for each node.Background We want to predict our outcome using the varibles we have in front of us; it is the next generation of step-wise regression anf can handle more varaibles than samples.
LASSO_model <- function(dep_var, dose, df = efficacy_summary) {
data <- na.omit(df) %>%
select_if(is.numeric) %>%
filter(dosage == dose)
response <- df %>%
select(dep_var)
predictors <- df %>%
select(c("PLA", "ULU", "RIM", "OCS", "ICS", "SLU", "SLE", "cLogP",
"huPPB", "muPPB", "MIC_Erdman", 'MICserumErd',
"MIC_Rv", "Caseum_binding", "MacUptake"))
y <- as.numeric(unlist(response))
x <- as.matrix(predictors)fit = glmnet(x, y)
coeff <- coef(fit,s=0.1)
coeff <- as.data.frame(as.matrix(coeff))
}LASSO_model(dep_var = "ELU", dose = 50)| predictor | coeff |
|---|---|
| (Intercept) | 1.2911027 |
| cLogP | 0.2908215 |
| muPPB | 0.0049209 |
efficacy.rf <- randomForest( ELU~ ., data =dataset,
na.action = na.roughfix,
ntree= 500,
importance = TRUE)These functions may be prone to several errors if:
Input datasets have low or high number of observations
Missing data are recorded differently. We noticed in the individual drug data, the “NA”, or missing data for spleen_efficacy had a space before the NA. This type of variation in how missing data is recorded could cause problems for the functions.
If drug names or codes change, this could create potential problems
If new independent variables or measurements are added to the dataframe
The dataset provided included two dose frequency combinations, 50 BID and 100 QD. If these dose and frequency combinations change it could cause problems with some of the functions.
The next step is to include more data representing more drugs
Validating the models (for RandomForest and Lasso) would help us understand the predictive power of the model to determine drug efficacy. The data can be subsetted and tested.
A function could be written that runs all of the models and outputs the coefficients